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(57) Abstract 




(P) ^ (i) ^ (1) ^ . 

Plural, arbitrarily-shifted, pseudo-random bits streams are generated from a sin^e linear feedbadc shift register (LFSR) 
(201). Each bit stream is obtained by tapping the outputs of selected LFSR cells (202) and feeding these tapped cell outputs 
through a set of exclusive-OR gates (206). The taps are selected in order to achieve the desired shift between bit streams. In addi- 
tion, the tap patterns can be selected so that the number of inputs (fan-in) to each bit stream are within predetermined bounds 
and that the number of taps per cell (cell load ) are within predetermined bounds. A disclosed computer program generates the 
tap patterns as a function of the number of cells and the structure of the LFSR, tiie number of output bit streams, the maximum 
allowed shift variation of the bit streams, and the bounds on fan-in and cell load. Each pseudo-random bit stream serves as an 
input to a low-pass filter which produces an essentially Gaussian noise output. The plural noise outputs are relatively uncorrelat- 
ed and can be used in a parallel stochastic learning neural network for purposes such as annealing. 
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GENERATOR OF MULTIPLE UNCORRELATED NOISE SOURCES 



BACKGROUND QP THR TKVRNTTnN 

A portion of the disclosure of this patent document 
contains material which is subject to copyright protection. The copyright 
5 owner has no objection to the facsimile reproduction by anyone of the 
patent document or the patent disclosure, as it appears in the Patent and 
Trademark Office patent file or records, but otherwise reserves all copyright 
rights whatsoever. 

In a prior art neural network test chip, a stochastic 

10 learning technique with a local learning rule was implemented in VLSI, 
(see, for example, U.S. Patent No. 4,874,964, issued October 17, 1989 to 
J. Alspector and R. B. Allen; J. Alspector and R. B. Allen, ''A 
neuromorphic vlsi learning system,** in Advanced Research in VLSI: 
Proceedings of the 1987 Stanford Conference, P. Losleben, Ed. Cambridge, 

15 MA: MIT Press, pp. 313-349, 1987; J. Alspector, R. B. Allen, V. Hu, and 
S. Satyanarayanna, ''Stochastic learning networks and their electronic 
implementation,** Proceedings of the conference on Neural Information 
Processing Systems, Denver, CO, pp. 9-21, Nov. 1987, D. Anderson, Ed. 
New York, NY: Am. Inst, of Phys., 1988; and J. Alspector, B. Gupta, and 

20 R. B. Allen, "Performance of a stochastic learning microchip** in Advances 
in Neural Information Processing Systems 1 /Dtnvttt CO, pp. 748-760, 
November 1988, D. S. Touretzky, Ed. San Mateo, CA: Morgan Kaufmann, 
1989). The Boltzman algorithm (D. H. Ackley, G. E. Hinton, and 
T. J. Sejnowski, "A learning algorithm for Boltzmann machines," Cognitive 

25 Science 9, pp. 147-169, 1985) depends on the stochastic settling of the neural 
system using the process of simulated annealmg (S. Kirkpatrick, 
C. D. Gelatt, and M. P. Vecchi, ''Optimization by simulated annealing,'* 
Science, 220, pp. 671-680, 1983) to avoid local minima in the energy 
function that describes its evolution. In the aforenoted prior art neural 

30 network prototype test chip, highly amplified Gaussian thernial noise 

generated by electrons in a transistor was used for annealing. Each neuron 
was fed by a separate thermal noise generator, so that its state would be 
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unaffected by the noise seen by the others. 

Neural learning algorithms such as this capture 
correlations seen by neural states to perform classification based on input 
data. For local learning rules, stochastic elements are necessary for, among 

S other reasons, performing unbiased averaging over neural states elsewhere 
in the network. Correlations in the noise they see would cause errors in the 
learning since these undesired correlations would be captured by the learning 
rule. Other reasons for stochastic elements in neural networks include the 
search of a large solution space, helping a network settle while avoiding 

10 local minima, and interpolating between discrete values of weights by time 
averaging. 

Although a thermal noise generator seems simple and 
unbiased it has implementation problems. In particular, it exacts a 
substantial area penalty; and, in fact, occupies much more area than the 

15 neuron itself. More significantly, the large gain needed to amplify thermal 
noise can lead to cross coupling of the on-chip amplifiers thereby frustrating 
the original purpose of using separate noise amplifiers to obtain zero cross 
correlation. Despite this, the small network on the prior art test chip 
demonstrated satisfactory learning for small problems. To scale this 

20 network to larger size, it would have to be sensitive to more subtle 

correlations and therefore the noise sources must show minimal correlation. 

A linear feedback shift register (LFSR) produces a 
pseudo-random bit stream (PRBS) that can be used to make an analog noise 
source. The PRBS is processed by a low-pass filter with cutoff frequency 

25 just a few percent of the clock frequency. This has the effect of performing 
a time integration over many bits. If each bit's value is randomly distributed 
with a probability of 0.5 for 0 or 1, then the value of this integration follows 
a binomial distribution that approaches a Gaussian distribution for a large 
number of bits. This creates a Gaussian analog pseudo-random noise source 

30 whose statistical properties are similar to the thermal noise which is to be 
modeled with a simulated annealing technique. Variable amplifiers with 
gains low enough to avoid coupling problems are then sufficient to perform 
the annealing process. An /^-stage LFSR creates a PRBS of maximal length, 
2^-1, when the feedback taps are chosen appropriately. One useful 

35 property of such a PRBS is that it has cross correlation -1/(2^-1) 
(effectively negligible) with a time shifted version of itself, assuming the 
cross correlation is calculated after replacing each 1 of the binary bit stream 
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with — 1 and each 0 with 1. (see, for example, S. W. Golomb, Shift Register 
Sequences, revised ed. Laguna Hills, CA: Aegean Park Press, 1982.) For 
neural network purposes, this time shift must be large enough for the 
network to settle sufficiently to "forget** the sequence during the anneal 
5 cycle before it sees another version of it later. In practice, this is obtained 
easily with relatively small shift registers because the length of the sequence 
grows exponentially with the shift register size. 

This shifting could be accomplished by using a 
collection of identical LFSRs, one per neuron. Each would be loaded with a 

10 specified initial state to obtain a desired shift relative to the other LFSRs. 
All LFSRs would be clocked simultaneously. The overhead of such an 
approach, however, is unacceptable. For instance, a single 2S-stage shift 
register (with a maximal period of 34 million clock cycles) would require 
approximately 400,000 square microns in 2 micron CMOS technology, which 

IS is considerably larger even than the thermal noise amplifier of the prior art 
implementation in the same technology. 

Various techniques for generating plural PRBS have 
been reported. For example, P. D. Hortensius, R. D. McLeod, W. Pries, 
D. M. Miller, and H. C. Card, describe a "Cellular automata-based 

20 pseudorandom number generators for built-in self-test,*' in IEEE Trans. 
Computer-Aided Design, vol. 8, no. 8, pp. 842-859, Aug. 1989. As disclosed 
therein, cellular automata are employed to generate pseudo-random bits in 
parallel. W. J. McFarland, K. H. Springer, and C.-S. Yen, describe a "1- 
gword/s pseudorandom word generator," in IEEE 7. Solid-State Circuits, vol. 

25 24, no. 3, pp. 747-751, June 1989. This pseudorandom word generator uses 
a feedback/feedforward technique with exclusive-OR gates at each shift 
register stage. This technique requires as least as many shift register stages 
as outputs. A wideband digital pseudo-Gaussian noise generator is disclosed 
in U.S. Patent No, 3,747,381, issued June 26, 1973 to W. J. Hurd. This 

30 noise generator requires at least two feedback shift registers of relatively 
prime lengths. Disadvantageously, in all these prior art noise and/or PRBS 
generators, the number of cells required linearly increases with the number 
of required bit streams, P. 

An object of the present invention is reduce to a 

35 minimum the hardware necessary to generate multiple pseudo-random noise 
sources required for annealing in neural networks. 
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An additional object of the present invention is to 
amortize the space required for a single generator of plural noise sources 
amongst many neurons in a neural network so that an acceptable small area 
overhead for VLSI implementations results. 
5 SUMMARY OP THR rnVRNTTHM 

In accordance with the present invention a single 
maximal length linear feedback shift register is used to generate multiple, 
arbitrarily-shifted, pseudo-random bit streams. Each hit stream is converted 
to an analog noise source by filtering. In particular, each bit stream is 

10 obtained by tapping the outputs of selected LFSR cells and feeding these 
tapped cell outputs through a parity tree consisting of exclusive-OR gates. 
In accordance with the invention, the particular cells of the LFSR tapped to 
form each bit stream are selected to meet certain constraints. In particular 
the taps are chosen so that: (1) the shift variation between bit streams is 

15 within a set limit; (2) each cell is tapped to provide an input to no fewer 
than and no greater than preset numbers of bit streams; and (3) each bit 
stream is formed from no fewer than and no greater than preset numbers of 
. cell outputs. 

An advantage of the present invention is that the 
20 number of cells needed to produce P bit streams grows as log(P ) rather 
than linearly with P. 
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BRTHP nRSCRTPTTON OP THP. PR AWTNG 

FIG. 1 is a schematic diagram of a conventional prior 
art linear feedback shift register used to make an analog noise source; and 

FIG. 2 is a schematic diagram of a single linear 
5 feedback shift register used to generate multiple pseudo-random bit streams 
in accordance with the present invention. 
DRTATLED DESCRTPTTON 

With reference to FIG. 1, the single A^-stage LFSR 101, 
also denoted Y in the equations derived hereinbelow, consists of N clocked 
10 D-type flip-flops 102-(N-1) - 102-0. The N stages, also called cells, are 

arrayed horizontally with the shift direction from left to right, i.e., the input 
of every cell except the leftmost cell is connected directly to the output of 
the cell on its left. The cells are numbered consecutively from (N - 1) to 0, 
with the (iV- l)th cell. 102-(N-1), on the left and the zeroth cell, 102-0, on 
15 the right. The signal fed to the D input of the (N - l)th (leftmost) cell, 102- 
(N-1), is obtained from the feedback function H. This is the modulo 2 sum of 
the outputs belonging to a subset of the N cells, that is, 

N-l 

4 2 2) (1) 

where 4 denotes "is defined to be equal to," Zt is the output of cell i, and 
20 each feedback coefficient ci is either 0 or 1. In the embodiment of FIG. 1, co 
and C3 equal 1 and the other C; equal 0. These are just chosen for 
illustration and in reality would be determined as a function of AT and the 
primitive polynomial thereof, to be defined hereinbelow. Exclusive-OR 
gate 103 forms the modulo 2 sum of the two fed back outputs of cells 102-0 
25 and 102-3. The output of gate 103 provides the D-input to cell 102-(N-1). 

To shift register Y 101 means to apply one or more 
clock pulses simultaneously to the CK clock inputs cells of Y 101. The 
clock is not shown. The PRBS generated by Y is, by definition, the sequence 
of bits generated by the zeroth (rightmost) cell, one bit per clock cycle, as Y 
30 is shifted. The sequence of states that Y evolves through as it is shifted is 
determined by the initial state and the feedback function H. Thus, the 
PRBS for a given LFSR depends on its initial state. If Y sequences through 
all possible nonzero states whenever it starts in a nonzero initial state, Y is 
said to be maximal. Maximality occurs only for certain choices of the 



feedback coefficients c/, namely, if the polynomial c(jc), where c{x) is 
defined by the expression 
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AT-l 
i-0 

is primitive in GF(2''), where GF(2^) denotes the Galois field with 2^ 
elements. A PRBS generated by a maximal AT-stage LFSR starting in a 
nonzero initial state is called an N-maximal PRBS. Some straightforward 
5 implications of maximality include (a) an AT-maximal PRBS has period 
l'^- 1, and (b) every possible combination of consecutive bits, except the 
allrzero combination, occurs somewhere in an AT -maximal PRBS. 

In the prior art analog noise generator in FIG. 1, the 
pseudo-random bit sequence is taken at the Q output of the rightmost cell. 

10 102-0. This digital bit sequence on lead 104 is processed by a low pass filter 
having a cutoff frequency just a few percent of the clock frequency, and 
consisting of resistor 105 and capacitor 106.. An essentially Gaussian analog 
pseudo-random noise source is thus created at output 107. 

With reference to FIG. 2, a single maximal length 

IS LFSR 201 is used to derive plural pseudo-random bit streams. As in the 
prior art described hereinabove, LFSR consists of N cells, 202-(N-l) - 202-0. 
Feedback is provided to the D mput of cell 202-(N-l) as determined by a 
primitive polynomial of the N-stage register. As in the prior art structure, 
feedback is provided in this illustrative example from the Q outputs of the 

20 0th and 3rd cells, 202-0 and 202-3, respectively, which are modulo 2 
summed by exclusive-OR gate 203. As above, these particular cells are 
selected just for purposes of illustration. 

It has been determined and mathematically proven by 
the inventors herein, that by tapping and modulo 2 combining the outputs of 

25 particularly selected cells of the maximal length LFSR , shifted versions of 
the basic bit pattern can be generated. By proper selection of the cells 
tapped, in fact, any of the 2^~ 1 possible shifts can be generated. If the 
shifts are sufficiently far apart, each combhnation of cell outputs can serve as 
a separate source of noise that is essentially uncorrelated with the other 

30 sources generated from the same LFSR. It is thus necessary to know which 
cells to tap to generate the plural bit streams that are shifted sufficiently 
apart to ensure low correlation. As will be described, the cells tapped can 
be chosen such that in addition to meeting a shift constraint, other 
constraints can be met that affect the physicality of a VLSI implementation. 

35 Advantageously, the number of bit streams that can be generated from the 
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single LFSR is not limited to the number of cells in the shift register. 

In the purely illustrative example in FIG. 2, P sources 
of random Gaussian noise are generated. As just noted, these noise sources 
are generated from P pseudo-random bit streams, which are shifted versions 
5 of each other, by modulo 2 combining the Q outputs of selected cells in the 
register. In the illustrative example of FIG. 2, the first bit stream on lead 
205-1 is produced from the modulo 2 combination of the fi outputs of cells 
202-0, 202-1, and 202-3 which are combined by exclusive-OR gates 206-1,1 
and 206-1,2. The bit stream on lead 205-1 is low-pass filtered by the RC 
0 filter, consisting of resistor 207-1 and capacitor 208-1, to produce the 

random noise source on lead 209-1. The other noise sources on leads 205-j, 
for 2 ^ j s P, are similarly produced by modulo 2 combming, through 
exclusive-OR gates 206-j,l and 206-j,2, the outputs of selected of the cells. 
The resultant bit stream is then filtered through a low-pass filter consisting 
5 of resistor 207-j and capacitor 208-j, to produce random noise at output 
209-j. In this illustrative example, each output bit stream is generated from 
three ceU outputs. As will be noted hereinafter, the minimum and 
maximum number of cells needed to be tapped to form any of the bit 
streams from the LFSR, called the minimum and maximum allowed fan-in, 
respectively, is a factor that can be controlled in selecting the tap patterns. 
Also, the minimum and maximum number of taps on any one cell in the 
LFSR, called the minimum and maximum allowed cell load, respectively, is 
controllable. 

In what follows, an algorithm for determining the taps 
will be provided. First, however, a mathematical foundation wiU be 
presented for the technique of the present invention. Two lemmas that are 
keys to the technique of the present invention for generating multiple bit 
streams from a single LFSR will be proven. The first lemma loosely says 
that the bit stream obtained from the modulo 2 combination of the outputs 
of the ceUs of a maximal LFSR gives a shifted version of the basic LFSR bit 
stream. The second lemma says that any desired shift can be obtained by 
appropriate choice of the taps. 

Preceding the rigorous mathematical foundation, let Aq 
denote an iV-maximal PRES. For each nonnegative integer k, let a^t € {0, 1} 
denote the value of the sequence Ao at clock cycle it. This is indicated with 
the notation Aq = {ao . , 02 . " ' ' }. For every positive integer m, define 
Am , a„+j , a„+2 » ' * " }• Note that A„ is obtained from Aq by shifting 
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forward in time by m clock cycles. Finally, for a given nonzero initial state 
of let S denote the set containing the all-zero sequence along with the 
shifted sequences A^, where 0^ m :s 2^ - 2. Lemma 1, which says, in 
general terms^ that the bitwise exclusive-OR of two shifted versions of a 
5 given //-maximal PRBS generates another shifted version of the same PRBS, 
can now be stated: 

Lemma 1: Let m and it be nonnegative integers. Let 
B - {Bq f bi, 9 * ' ' } denote the sequence obtained by a bitwise 
exclusive-OR of A^^S and A„ € 5, that is, b^^a^^i + a^+i (mod 2). Then 
10 B € 5. 

Proof: A q is generated by a recursion relation of the 

form 

^k^N^ 2 (mod 2) (3) 

1=0 

where the feedback coefficients C; are either 0 or 1. Clearly, A^ and A^ also 

15 satisfy this recursion relation. Since Eq. (3) is linear, B (which equals the 
bitwise modulo 2 sum of and A^) satisfies Eq. (3) as well. Thus, the 
entire sequence B is determined by its first N bits. Suppose m and n are 
equal modulo 2^-1. Then A„ and A„ are identical sequences; thus, B is the 
all-zero sequence and is therefore a member of S. Now suppose m and n are 

20 unequal modulo 2^—1. Then A^ and A„ are not identical sequences and B 
is not the all-zero sequence. In particular, the first N bits of B cannot all be 
zero (otherwise, Eq. (3) would imply that B is the all-zero sequence). Since 
A 0 is an JV-maximal PRBS, all possible combinations of N consecutive bits 
except the all-zero combination must occur in Aq. Thus, there must be some 

25 nonnegative r such that the first bits of Ar equal the first N bits of fi. 
Since Aq is periodic with period 2^-1, there is no loss of generality in 
assuming r < 2^ - 1. Thus B == A,. € 5. Q.E.D. 

Lemma 1 is a special case of the more general Abelian 
group property of S under bitwise modulo 2 addition. 

30 A pair of taps from an LFSR gives two particular 

shifted sequences from a restricted set. Their exclusive-OR gives a third 
sequence by Lemma 1. This sequence, in turn, can be exclusive-OR 'ed with 
another tap to give still other shifted versions of the main sequence. 
Lemma 1 thus implies that given a maximal LFSR generating a PRBS, the 
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outputs of a collection of cells of this LFSR can be tapped and the mod 2 
sum of these outputs taken to obtain a shifted version of the PRBS. The 
question arises whether any specified shift can be obtained by appropriate 
choice of the taps. Lemma 2 hereinbelow answers this question in the 
5 affirmative. 

Lemma 2: Let Y denote an W-stage maximal LFSR that 
is initialized to a nonzero state, and let zf , 0 s i s AT - 1, Jt s 0, denote the 
output of cell / of y at clock cycle k. For a collection of coefficients 
^{^{0, 1}, 0 s i s iV - 1, define a sequence G 4{go , 8i, gi* ' ' } such 
10 that 

«*4 2 ^i^i (mod 2) (4) 

Then for every A, € S, there exists a collection of 
coefficients di such that G = A^. 

Note: In what follows, the coefficients J,- are called the 

15 tap coefficients. denotes the set of iV-dimensional vectors with 
components 0 and 1. F'^ can be identified with GF(2^). 

Proof: From Lemma 1, G € 5. Each collection of tap 
coefficients d=[do, di, • • • , dff^i]^ (T denotes transpose) is identified 
with a member of F^. Consider the function G:F^-.S that maps (accqrding 

20 to Eq. (4)) each tap coefficient vector d € F^ to its corresponding sequence 
G € 5. It will be shown that Q is injective. Since is a linear map, it is 
injective if and only if it maps all nonzero points in its domain to nonzero 
points in its range. Let d' be a nonzero point of F^'. Then there exists an m 
such that the mth element of d' is not zero, that is, d'^ = 1. Since the shift 

25 register Y is maximal, there exists a clock cycle it such that = 1 and zf = 0 
for all i ^ m. By Eq. (4), the bit value of G* iG( d' ) at clock cycle k is 
^fm = 1. Thus, G* is not the all-zero sequence. Note that F'' has 2^ 
elements and S has 2^ sequences. Since the function Q is injective, it 
follows that Q is surjective because its domam and range have a finite and 

30 equal number of elements. Q.E.D. 

It is thus proven that for a maximal LFSR, the 2*^^- 1 
nonzero tap patterns map uniquely to the 2^-1 possible shift values 
(0,1, ■ • . 2^ - 2). Therefore, any shift is possible if the right tap 
pattern is found, and each tap pattern can be identified with a unique shift. 
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These two viewpoints form the basis for the practical problem to be solved; 
namely, that of generating properly shifted versions of the original bit 
stream in a hardware-efficient manner. 

The constraints due to VLSI implementation of a 
5 neural net model are first described: 

1. The bit streams should be shifted far enough apart so that the network 

can settle without seeing a shifted version of a noise source in two 
places. This implies close to equal spacing of the bit streams. In 
practice, this constraint can be relaxed considerably or eased by 
10 simply increasing the shift register size. 

2. For performance reasons, the fan-out per cell is limited; that is, loading 

any flip-flop in the register more than is necessary should be 
avoided. 

3. As few inputs as. possible to each set of exclusive-OR gates associated 
IS with a bit stream is desired. This reduces silicon area and unproves 

performance. In fact, layout simplicity may require an equal 
number for all sets. 

To formulate a precise problem statement, again let Y 
denote a maximal iV-stage LFSR, and let p 42^- 1 denote the period of the 
20 PRBS Ao generated by Y for some specified nonzero initial state. Let L be a 
nonnegative integer, let L , F, F, and P be positive integers, and let r be a 
real number such that 0 :S r < 1. Let do , di , • • , dp.i € denote a 
collection of P tap coefficient vectors to be determined; and let G; € 5 denote 
the sequence corresponding to d/, as in the proof of Lemma 2. Let si 
25 denote the shift of Gi relative to Aq, where 0 ^ j/ < p. Without loss of 
generality, assume si s 5^+1 for alH < ? - 1. Define the shift differences ti, 
Q^i^P- 1, as follows: 

^ J5i+i - j|, if Ors I ^ P - 2 

Let u € denote the ?-vector whose ith component is ui-\ti/{p/P ) - 1 1. 
30 This is the normalized version of the shift difference. Two vectors, H 
and f € R^, both of which have integer-valued components, are associated 
with a given collection of tap coefficient vectors do • , • • • , Ap^i € F^. 
The component /; of 1 is the number of taps connected to cell 2 of Y. The 
component fi of f is the number of Is in (i.e., the number of cell taps 
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represented by) the tap vector d/. Let C:R^xR^xR^-[0, «) denote a 
cost function. C(u , f , 1) is the cost associated with a collection of tap 
coefficient vectors do , di , • • • , dp^i € F^. With these definitions, the ^ 
problem can be stated precisely. The implementation constraints noted 
5 above can be restated in mathematical terms as follows: 

Problem Statement: A collection of tap coefficient 
vectors do , di , - - • , dp«i € needs to be found that minimizes the cost 
C(u, f , 1) subject to the following conditions: 

1. Ui = |f;/(/7/P ) - 1 I ^ r for all i. The parameter r is the maximum 
10 allowed shift variation. 

2. No cell of Y has fewer than L taps or more than L taps (L ^ // < L for 

O^i^N-1). The integer L (resp., L) is the aforenoted minimum 
(resp,, maximum ) allowed cell load. 

3. No tap coefficient vector d^ has fewer than F components equal to 1 or 
15 more than F components equal to 1 (F < ^ F for 0 ^ i :S P - 1) . 

The integer F (resp., F) is the aforenoted minimum (resp., 
maximum ) allowed fan-in. 
Note: if an iV X ? matrix is formed such that column i equals vector d^, then 
condition 2 says that no row has fewer than L Is or more than L Is, and 

20 condition 3 says that no column has fewer than F Is or more than F Is. 

The cost function C is chosen so that minimizing it 
tends to minimize the components of 1, f, and u. Minimizing the 
components of 1 alleviates the speed degradation caused by capacitive 
loading on the cells of Y. Minimizing f minimizes the fan-in (number of 

25 inputs) of the exclusive-OR gates whose outputs form the bit streams. 
Clearly, I and f are strongly correlated (minimizing the components of one 
tends to minimize those of the other). Minimizing the components of u 
tends to keep the bit streams uniformly separated in time. The exact form 
of the cost function C depends on the relative importance of minimizing 

30 these various quantities in a particular application. 

If the loads on the cells of the shift register or the fan- 
ins of the exclusive-OR gates are of no concern, then the cost function C ' 
does not depend on 1 or f; moreover, L and F are small enough and L and F 
are large enough so that conditions 2 and 3 are satisfied trivially. The ^ 

35 problem then reduces to generating P bit streams with specified, exact time 
separations. This problem has a simple analytical solution. 
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To see this, first note that the evolution of the shift 
register's state is governed by the following equation: 







f 4 1 




= M 











(6) 



where the state transition matrix M is defined as follows: 



m4 



Co Ci • • ' Cif^i 



(7) 



Here I(iv-i)x(iv-i) denotes the (AT- l)x(A/^- 1) identity matrix, O^.j denotes 
the (AT — l)-component all-zero column vector, and the ci are the feedback 
coefficients from Eq. (1). 

Lemma 3 hereinbelow says that the taps for a given shift t are obtained 
10 explicitly by merely calculating the matrix M' and inspecting its first row. 

Lemma 3: Let Y denote a maximal LFSR initialized in 
a nonzero state and with state transition matrix M. Let f be a nonnegative 
mteger. Then the tap coefficient vector d for Y that gives a shift forward in 
time by t clock cycles is the transpose of the first row of the matrix M'. 
15 Proof: Let z* denote the vector with components zf 

(cf. Eq. (6)). Let eo € denote the column vector with 1 as its zeroth 
component and Os for the remaining N - 1 components. Then the value of 
the PRBS generated by Y iat clock cycle k is 



flit = ejM^z^ 



(8) 



20 For any tap coefficient vector d, the output generated 

at clock cycle it is d^z*= d^M*z^. If d^= ejM^ is chosen, then the output at 



clock cycle k is ejM^M*z^ 
Eq. (8). Q.E.D. 



But this is the same as a^^f, by 
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Lemma 3 provides a solution when the loads or fan-ins 
are of no concern. Note that M' can be calculated in log(r) CPU time. 
One can calculate a table containing the matrix powers 
M° , , , • • • , M^~\ Then the binary representation of t 

can be used to choose the powers of M to multiply together to calculate M'. 

The previous special case showed that it is easy to 
calculate the taps necessary to obtain exact shifts when the load or fan-in are 
not a concern. When they are a consideration, the shifts must be allowed to 
vary from their nominal value (i.e., select a nonzero value for r) and a 
heuristic technique must be used to find a "good" set of taps. Since a fairly 
wide variance in the shift values can be allowed for this noise-generating 
application, solution candidates are abundant and a large state space may be 
searched to find a solution with acceptably low fan-in and cell load. 

The software solution implemented for this problem 
can be described as follows. First, consider the set of tap patterns with K 
taps for an iV-cell shift register. The number of such patterns is 

The set of essential K-tap patterns is defined to be the smallest subset from 
which all K-t&p patterns can be obtained by right-shifting a pattern from this 
subset by zero or more positions. When right-shifting a pattern, zeros are 
padded on the left. The set of essential patterns has only (^ij) members, 
or K/N times the number of total patterns. For example, the number of 2- 
tap patterns for a 10-ceU register is (^^J = 45, while there are only (J) = 9 
essential patterns, viz., 

1100000000 
1010000000 
1001000000 
1000100000 
1000010000 
1000001000 
1000000100 
1000000010 
1000000001 

Note that once the bit stream shifts for the essential /iT-tap patterns ar« 
found, the bit stream shifts for all other AT-tap patterns can be found 
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trivially. For example, if the shift of 1010000000 is q, then the shift of 
0001010000 is 9 - 3 because the latter pattern is obtained from the former 
by right-shifting three bit positions. 

Let X denote the collection of all essential tap 
5 coefficient vectors d € with at least F Is but not more than F Is. The 
number of elements ia X, llxB, is 

(This is a polynomial in N of order F — 1.) For each d € X, a record is stored 
in main memory that contains a representation of d along with the shift of 

10 its corresponding sequence (see hereinbelow regarding the calculation of this 
shift). Note that memory usage is greatly reduced by including only the 
essential tap patterns in the set X. Simulated annealing, or any desired 
random or deterministic technique, is used to search X to find a collection of 
tap coefficient vectors that minimizes the cost function and satisfies 

15 conditions 1 and 2 (condition 3 is satisfied by construction). If a solution 
exists, this method will find it given enough CPU time. 

In practice, it was discovered that even the set of tap 
coefficient vectors with only two Is produces shifts that are fairly well 
distributed throughout the interval [0 , 2^—2]. Thus, the procedure is 

20 normally tried first with X containing just the tap coefficient vectors with F 
Is. The members of X are bucket-sorted according to the nominal shift 
value to which they are closest. If all the buckets contain at least one tap, a 
solution is sought. If no satisfactory solution can be found, then the tap 
coefficient vectors with F + 1 Is are added to X and bucket-sorted, and the 

25 best solution is sought again. This process (of adding a new set of tap 
coefficient vectors to X then searching X for the best solution) is continued, 
if necessary, until the tap coefficient vectors with F Is have been added to 
X. 

Finding the shift associated with each d €X can take 
30 significant CPU time. One straightforward way to do this is a method called 
simple shifting. Here an efficient representation Y of the shift register is 
implemented using the word operations of the host computer. For a given 
nonzero initial state of the shift register, the first iST bits of the sequence G 
corresponding to a given tap coefficient vector d can be calculated easily 
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using Y. Let g* € F*' denote the first N bits of G. Note that g* represents 
the state of Y at the clock cycle that equals the shift of G. Thus, starting at 
the given initial state 2", Y is shifted one clock cycle at a time until its state 
is found to equal g*. The clock cycle where this equality occurs is the shift 
5 associated with d. 

The simple shifting method uses 0(1) (i.e., constant) 
memory and 0(2^) CPU time. It can exact a large time penalty for 
practical problems. For example, a maximal 25-stage shift register has a 
sequence length of 34 million clock cycles. Thus, it would be expected that 
10 it would be necessary to shift Y an average of 17 million times for each 
d € Z. In practice, however, it has been found that simple shifting is too 
slow for problems of "practical" size, i.e., when the shift register has more 
than about 20 cells. 

Faster calculation at the expense of increased main 

15 memory usage can be obtained with a variant of what is known as Shanks' 
giant step/baby step method, (see, for example, D. E. Knuth, The Art of 
Computer Programming, Vol. Ill: Sorting and Searching. Reading, MA: 
Addison-Wesley, 1973, p. 9.) Here are stored a collection of bit patterns 
representing the states of the shift register at uniformly-spaced clock cycle 

20 intervals. Then given a tap coefficient vector d, the associated shift register 
state g is calculated, as was done for the simple shifting method. The shift 
register representation Y is started in the state g*. It is then shifted one 
clock cycle at a time until its state is found to equal one of the bit patterns 
stored in the table. The shift associated with d is then the shift of the table 

25 bit pattern less the number of shifts needed to bring Y to that state. 

In more detail, this method proceeds as follows. First, 
a "reasonable" giant step size h is chosen. As will be noted, a small h 
implies a fast calculation of the shift for each tap pattern, but the cost in 
memory usage and table setup time grows as h becomes smaller. Therefore, 

30 a compromise value of h must be chosen. For the example of a 25-stage 
shift register, h = 5000 might be chosen. Next, for all integers i such that 
/ ^ 0 and ih ^ 2*^ - 2, a hash table is filled with records, each containing the 
integer ih and the bit pattern M*z° for some specified nonzero initial state 
z°. For the example, this means that v4l(2^- 2)//i J + 1 = 6711 bit 

35 patterns are calculated and installed in the hash table, where [*J denotes the 
greatest integer not greater than x. If E = M* is initially calculated, then the 
hash table building takes v matrix multiplications. Once w » M V has been 
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calculated for some value of i, M^'**'^^*z^ is simply Ew. 

As in the case of the simple shifting method, let Y 
denote an efficient representation of the shift register, let d denote a tap 
pattern, and let g* denote the first N bits of the bit stream corresponding to 
5 d when the initial state of Y is To find the shift t associated with d, Y is 
initialized to the state g*. Also, counter r is initialized to zero* Then t is 
found as follows: 

1. Lookup the bit pattern that represents the state of Y in the hash table. 

If this bit pattern, which equals M'^z^ for some i, is in the hash 
10 table, set t^ih — r and exit from loop; otherwise, go to step 2. 

2. Shift Y by one clock cycle and increment counter r by 1. Go to step 1. 
Note that the loop will never be executed more than h times, and, on 
average, it is executed h/2 times for each calculation of t. That is, the time 
complexity of each t calculation is 0(h). This results in a significant savings 

IS in CPU time for each t calculation relative to the simple shift method. Since 
V ~ bit patterns must be stored in the hash table, the memory 
complexity in terms of AT and h is 0(2^/h ). The time required to calculate 
the V bit patterns in the hash table is also proportional to v and is therefore 
Oi ). Clearly, the value of h must be chosen based on N and the 

20 number of t calculations to minimize the total time (setup time plus t 

calculation time) while keeping the memory usage within reasonable limits. 

The tap-calculating procedure described above has 
been implemented in the C programming language ( B. W. Kernighan and 
D. M. Ritchie j The C Programming Language, Prentice-Hall, Inc., 1978). 

25 For the shift registers of interest (i.e. those having fewer than 30 stages), it 
was found that the giant step/baby step algorithm was adequate, for tap shift 
calculations. Even after the floating-point intensive code for tap cost 
calculation was optimized for efficient execution on a vector processor 
machine, it was found that the CPU time bottleneck was the solution search 

30 (optimization) step, not the tap shift calculation. For shift register larger 
than 30 stages, other algorithms may be needed for tap shift calculation. 

A listing of the program appears in APPENDIX A. 
The user inputs the number of cells in the register, the feedback pattern, 
and the number of bit streams required. Also input is the maximum and 

35 minimum allowable loading on a cell, the maximum and minimum allowed 
fan-in, and the maximum allowed shift variation. In minimizing the cost 
function C, weighting factors are assigned to the u, 1^ and f components. 
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which are also specified by the user. In addition the user specifies the 
coefficients of a penalty function used when a potential solution falls outside 
the specified ranges. 

The program was used to derive the tap patterns for 32 
5 bit streams as generated from a 25-stage shift register. Since a 25-stage shift 
register produces a PRBS of maximal length 33,554,432 (2^ - 1 ), the time 
separation between bit streams is approximately one million clock cycles. 
The solution is shown in TABLE I. This solution search was run with the 
maximum and minimum fan-in set equal to three (F = F = 3). The 

10 minimum cell load (L) was set at three and the maximum cell load (L) was 
set at four. The maximum allowed shift variation was 0.4. The resulting 
solution had an average load per shift register cell of 3.8, with four cells 
having three connections and 21 cells having four connections. The actual 
maximum shift variation for this solution (maximum ui from condition 1 of 

15 problem statement hereinabove) was 0.32. Each tap pattern line in TABLE 
I indicates the cells to be tapped to produce the bit stream having the 
particular shift. As an example, the first bit stream is generated from the 
modulo 2 combination from the outputs of the cells 9, 13, and 21, the cells 
being numbered from 0 to 24 from right to left, as hereinabove. 
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TABLE I 




iBv ranern ooianon 


nnmber of bit streams: 32 


feedback cells: 0 and 3 


feedback patt»n: 0000000000000000000001001 


sequence length: 33,554,431 


fan-in « 3, all bit streams 


maximum nnmber of taps on a cell: 4 


minimum number of taps on a cell: 3 


average number of taps per cell: 3.8 


maximum shift variation: 0.32 


Tap Pattern 


Shift 


0001000000010001000000000 


813 


OOQOIOOOOOOOOOOIOOOIOOOOO 


977154 


0000000100000000101000000 


1918423 


1000000001000000001000000 


3065848 


0000000000100000110000000 


4202921 


0001010000000010000000000 


5199153 


0000000000100000010000001 


6561452 


1000000000000001000000001 


7319152 


0000001000000000001010000 


8405832 


0000000000110000010000000 


9153942 


0000000001001000000100000 


10119558 


0000001000000000000000110 


11177548 


0100000010000000001000000 


12240651 


0000000110000100000000000 


13588669 


0000000000000010000000110 


14519726 


0010000000010000000001000 


15641663 


OOOOOIOOOOOOOOOOIOOIOOOOO 


16777233 


1100000000000000000001000 


17769307 


0001010000000000000000100 


18739334 


1000000010000000000000001 


19774592 


0000101000001000000000000 


20796598 


OOOOlOOlOOOOOOOOOOOOOOOlO 


22115686 


0000000100000000100000010 


23150710 


0000000001001000000010000 


24450889 


0000001000000100010000000 


25165828 


0000000000001100000000100 


26245895 


0010100000000010000000000 


27177326 


0100000000010001000000000 


28164123 


0010000000000010000100000 


29180947 


0010000010000100000000000 


29900123 


0100000001000000000001000 


31268791 


0001010000000000000010000 


32533378 


4444444444344444444433443 


loading pattern 
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By using the techniques of the present invention in a 
CMOS implementation of a neural network to generate plural uncorrelated 
analog noise sources for annealing, the substantial cost in silicon area of the 
LFSR can be amortized over many neurons while the incremental cost per 
5 neuron is limited to some simple combinatorial logic. The function of the 
low-pass filter could also be served by the frequency response of the neuron 
itself, thereby saving the area cost associated with the filter. In addition to 
the area advantage, a single LFSR avoids the control and synchronization 
problems of multiple LFSRs. 

10 In the example of TABLE I, the maximal length 

sequence becomes 32 separate bit streams with an average separation of 
about one million clock cycles. By clocking the LFSR at 100 MHz, each bit 
stream is separated from repetition by its nearest neighbor by approximately 
10 milliseconds. Each bit stream is low pass filtered to about 5 MHz. An 

15 anneal <^cle of 10 microseconds would therefore have about 50 analog zero 
crossings. In a network containing 32 neurons, each neuron would see noise 
that would not be repeated anywhere else in the network for about 1000 
anneal cycles which is a substantially greater separation than is required. 
This same LFSR could conceivably be used for 1000 times as many neurons. 

20 For neural network applications, therefore, shift spacing is less important 
for design than fan-in or fan-out considerations. 

The hardware advantage of the present invention is 
particularly important when such large nunibers of bit streams need to be 
generated. In the present invention, for a given relative shift spacing 

25 between the bit streams, the number of cells in the shift register grows as 
log(P ), where P is the number of bit streams. In contrast, the hardware 
requirement for prior art methods grows directly with P. In future 
generations of neural network chips, it is envisioned that hundreds and 
perhaps thousands of bit streams will be required. Accordingly, the 

30 hardware advantage of the present invention will be significant. 

Although described in connection with providmg noise 
sources for stochastic neural networks, the present invention has other 
applications. For example, bit-error rate testers use a pseudo-random bit 
stream to test communication systems at high speed. The speed is limited by * 

35 the rate at which the shift register can be clocked. By providing multiple 
uncorrelated noise sources and then multiplexing them, a new pseudo- 
random bit stream at higher speed can be provided because a multiplexer 
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can operate faster than a shift register in a given technology. Alternatively, 
the bit-error rate tester can provide multiple outputs for parallel testing, 
which is generally not available in currently available equipment . 

The above-described embodiment is illustrative of the 
5 principles of the present invention. Other embodiments could be devised by 
those skilled in the art without departing from the spirit and scope of the 
present invention. 



.22- 



0 



What is claimed is: 

1. A generator of plural pseudo-random bit streams 

comprising 

a single maximal length linear feedback shift register 
having a plurality of cells; 

for each bit stream, means for modulo 2 combining the 
tapped outputs of selected ones of said ceUs to produce the bit stream, the 
ceU outputs selected to be tapped and combined being determined so that 
the bit streams are separated by predetermined shifts. 

2. A generator of plural pseudo-random bit streams in 
accordance with claim 1 further comprising means for low-pass fUtering each 
of the plural bit streams to produce plural sources of essentially Gaussian 
noise. 

3. A generator of plural pseudo-random bit streams 

5 comprising 

a single maximal length linear feedback shift register 
having a plurality of cells; 

for each bit stream, means for modulo 2 combining the 
tapped outputs of selected ones of said cells to produce the bit stream, the 
0 ceU outputs selected to be tapped and combined being determined so that 
the maximum allowed shift variation between bits streams, the maximum 
and minimum allowed fan-in, and the maximum and minimum cell load are 
within predetermined limits. 

4. A generator of plural pseudo-random bit streams in 

5 accordance with claim 3 further comprising means for low-pass filtering each 

of the plural bit streams to produce plural sources of essentially Gaussian 
noise. 

5. A stochastic element for a neural network 

comprising 

a single maximal length linear feedback shift register 
having a plurality of cells; 

means for producing plural pseudo-random bit streams 
from said single shift register by modulo 2 combining for each bit stream the 
tapped outputs of selected ones of said cells, the cell outputs selected to be 
tapped and combined being determined so that the bit streams are separated 
by predetermined shifts. 
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6. A stochastic element for a neural network in 
accordance with claim 5 further comprising means for low-pass filtering each 
of the plural bit streams to produce plural sources of essentially Gaussian 
noise. 

^ 7. A stochastic element for a neural network 

comprising 

a single maximal length linear feedback shift register 
having a plurality of cells; 

means for producing plural pseudo-random bit streams 
10 from said single shift register by modulo 2 combining for each bit stream the 
tapped outputs of selected ones of said cells, the cell outputs selected to be 
tapped and combined being determined so that the maximum allowed shift 
variation between bits streams, the maximum and minimum allowed fan-in, 
and the maximum and minimum cell load are within predetermined limits. 

8. A stochastic element for a neural network in 
accordance with claim 7 further comprising means for low-pass filtering each 
of the plural bit streams to produce plural sources of essentially Gaussian 
noise. 
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